{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "42b5e80b-ad1d-4335-a1f7-10a91127e3dc"
    }
   },
   "source": [
    "# Time Series Forecasting with Linear Learner\n",
    "_**Using Linear Regression to Forecast Monthly Demand**_\n",
    "\n",
    "---\n",
    "\n",
    "---\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [Background](#Background)\n",
    "1. [Setup](#Setup)\n",
    "1. [Data](#Data)\n",
    "1. [Train](#Train)\n",
    "1. [Host](#Host)\n",
    "  1. [Forecast](#Forecast)\n",
    "1. [Extensions](#Extensions)\n",
    "\n",
    "---\n",
    "\n",
    "## Background\n",
    "\n",
    "Forecasting is potentially the most broadly relevant machine learning topic there is.  Whether predicting future sales in retail, housing prices in real estate, traffic in cities, or patient visits in healthcare, almost every industry could benefit from improvements in their forecasts.  There are numerous statistical methodologies that have been developed to forecast time-series data, but still, the process for developing forecasts tends to be a mix of objective statistics and subjective interpretations.\n",
    "\n",
    "Properly modeling time-series data takes a great deal of care.  What's the right level of aggregation to model at?  Too granular and the signal gets lost in the noise, too aggregate and important variation is missed.  Also, what is the right cyclicality?  Daily, weekly, monthly?  Are there holiday peaks?  How should we weight recent versus overall trends?\n",
    "\n",
    "Linear regression with appropriate controls for trend, seasonality, and recent behavior, remains a common method for forecasting stable time-series with reasonable volatility.  This notebook will build a linear model to forecast weekly output for US gasoline products starting in 1991 to 2005.  It will focus almost exclusively on the application.  For a more in-depth treatment on forecasting in general, see [Forecasting: Principles & Practice](https://robjhyndman.com/uwafiles/fpp-notes.pdf).  In addition, because our dataset is a single time-series, we'll stick with SageMaker's Linear Learner algorithm.  If we had multiple, related time-series, we would use SageMaker's DeepAR algorithm, which is specifically designed for forecasting.  See the [DeepAR Notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/deepar_synthetic/deepar_synthetic.ipynb) for more detail.\n",
    "\n",
    "---\n",
    "\n",
    "## Setup\n",
    "\n",
    "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.\n",
    "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "bucket = '<your_s3_bucket_name_here>'\n",
    "prefix = 'sagemaker/DEMO-linear-time-series-forecast'\n",
    " \n",
    "# Define IAM role\n",
    "import boto3\n",
    "import re\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "b2548d66-6f8f-426f-9cda-7a3cd1459abd"
    }
   },
   "source": [
    "Now we'll import the Python libraries we'll need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "nbpresent": {
     "id": "bb88eea9-27f3-4e47-9133-663911ea09a9"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import io\n",
    "import os\n",
    "import time\n",
    "import json\n",
    "import sagemaker.amazon.common as smac\n",
    "import sagemaker\n",
    "from sagemaker.predictor import csv_serializer, json_deserializer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "142777ae-c072-448e-b941-72bc75735d01"
    }
   },
   "source": [
    "---\n",
    "## Data\n",
    "\n",
    "Let's download the data.  More information about this dataset can be found [here](https://rdrr.io/github/robjhyndman/fpp/man/gasoline.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "nbpresent": {
     "id": "78105bc7-ce5d-4003-84f6-4dc5700c5945"
    }
   },
   "outputs": [],
   "source": [
    "!wget http://robjhyndman.com/data/gasoline.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "b472326f-3584-4b61-aecc-04b35486a1ab"
    }
   },
   "source": [
    "And take a look at it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "nbpresent": {
     "id": "f8976dad-6897-4c7e-8c95-ae2f53070ef5"
    }
   },
   "outputs": [],
   "source": [
    "gas = pd.read_csv('gasoline.csv', header=None, names=['thousands_barrels'])\n",
    "display(gas.head())\n",
    "plt.plot(gas)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "1c44e72e-1b0d-4dcb-91b9-9b9f28a697b0"
    }
   },
   "source": [
    "As we can see, there's a definitive upward trend, some yearly seasonality, but sufficient volatility to make the problem non-trivial.  There are several unexpected dips and years with more or less pronounced seasonality.  These same characteristics are common in many topline time-series.\n",
    "\n",
    "Next we'll transform the dataset to make it look a bit more like a standard prediction model.  Our target variable is `thousands_barrels`.  Let's create explanatory features, like:\n",
    "- `thousands_barrels` for each of the 4 preceeding weeks.\n",
    "- Trend.  The chart above suggests the trend is simply linear, but we'll create log and quadratic trends in case.\n",
    "- Indicator variables {0 or 1} that will help capture seasonality and key holiday weeks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "nbpresent": {
     "id": "6af8d66e-2ef6-4e8d-bb23-d2bd3dbb0b20"
    }
   },
   "outputs": [],
   "source": [
    "gas['thousands_barrels_lag1'] = gas['thousands_barrels'].shift(1)\n",
    "gas['thousands_barrels_lag2'] = gas['thousands_barrels'].shift(2)\n",
    "gas['thousands_barrels_lag3'] = gas['thousands_barrels'].shift(3)\n",
    "gas['thousands_barrels_lag4'] = gas['thousands_barrels'].shift(4)\n",
    "gas['trend'] = np.arange(len(gas))\n",
    "gas['log_trend'] = np.log1p(np.arange(len(gas)))\n",
    "gas['sq_trend'] = np.arange(len(gas)) ** 2\n",
    "weeks = pd.get_dummies(np.array(list(range(52)) * 15)[:len(gas)], prefix='week')\n",
    "gas = pd.concat([gas, weeks], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "1c77ea86-256b-4601-a5d5-2f875c0649c9"
    }
   },
   "source": [
    "Now, we'll:\n",
    "- Clear out the first four rows where we don't have lagged information.\n",
    "- Split the target off from the explanatory features.\n",
    "- Split the data into training, validation, and test groups so that we can tune our model and then evaluate its accuracy on data it hasn't seen yet.  Since this is time-series data, we'll use the first 60% for training, the second 20% for validation, and the final 20% for final test evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "nbpresent": {
     "id": "80c0adca-5db2-4152-a9f4-42cbc1dbde84"
    }
   },
   "outputs": [],
   "source": [
    "gas = gas.iloc[4:, ]\n",
    "split_train = int(len(gas) * 0.6)\n",
    "split_test = int(len(gas) * 0.8)\n",
    "\n",
    "train_y = gas['thousands_barrels'][:split_train]\n",
    "train_X = gas.drop('thousands_barrels', axis=1).iloc[:split_train, ].as_matrix()\n",
    "validation_y = gas['thousands_barrels'][split_train:split_test]\n",
    "validation_X = gas.drop('thousands_barrels', axis=1).iloc[split_train:split_test, ].as_matrix()\n",
    "test_y = gas['thousands_barrels'][split_test:]\n",
    "test_X = gas.drop('thousands_barrels', axis=1).iloc[split_test:, ].as_matrix()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "ff9d10f9-b611-423b-80da-6dcdafd1c8b9"
    }
   },
   "source": [
    "Now, we'll convert the datasets to the recordIO-wrapped protobuf format used by the Amazon SageMaker algorithms and upload this data to S3.  We'll start with training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "buf = io.BytesIO()\n",
    "smac.write_numpy_to_dense_tensor(buf, np.array(train_X).astype('float32'), np.array(train_y).astype('float32'))\n",
    "buf.seek(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "key = 'linear_train.data'\n",
    "boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)\n",
    "s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)\n",
    "print('uploaded training data location: {}'.format(s3_train_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll convert and upload the validation dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "buf = io.BytesIO()\n",
    "smac.write_numpy_to_dense_tensor(buf, np.array(validation_X).astype('float32'), np.array(validation_y).astype('float32'))\n",
    "buf.seek(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "key = 'linear_validation.data'\n",
    "boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation', key)).upload_fileobj(buf)\n",
    "s3_validation_data = 's3://{}/{}/validation/{}'.format(bucket, prefix, key)\n",
    "print('uploaded validation data location: {}'.format(s3_validation_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "f3b125ad-a2d5-464c-8cfa-bd203034eee4"
    }
   },
   "source": [
    "---\n",
    "## Train\n",
    "\n",
    "Now we can begin to specify our linear model.  First, let's specify the containers for the Linear Learner algorithm.  Since we want this notebook to run in all of Amazon SageMaker's regions, we'll use a convenience function to look up the container image name for our current region.  More details on algorithm containers can be found in [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "container = get_image_uri(boto3.Session().region_name, 'linear-learner')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Amazon SageMaker's Linear Learner actually fits many models in parallel, each with slightly different hyperparameters, and then returns the one with the best fit.  This functionality is automatically enabled.  We can influence this using parameters like:\n",
    "\n",
    "- `num_models` to increase to total number of models run.  The specified parameters will always be one of those models, but the algorithm also chooses models with nearby parameter values in order to find a solution nearby that may be more optimal.  In this case, we're going to use the max of 32.\n",
    "- `loss` which controls how we penalize mistakes in our model estimates.  For this case, let's use absolute loss as we haven't spent much time cleaning the data, and absolute loss will adjust less to accomodate outliers.\n",
    "- `wd` or `l1` which control regularization.  Regularization can prevent model overfitting by preventing our estimates from becoming too finely tuned to the training data, which can actually hurt generalizability.  In this case, we'll leave these parameters as their default \"auto\" though.\n",
    "\n",
    "Let'd kick off our training job in SageMaker's distributed, managed training.  Because training is managed (AWS handles spinning up and spinning down hardware), we don't have to wait for our job to finish to continue, but for this case, we'll use the Python SDK to track to wait and track our progress."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sess = sagemaker.Session()\n",
    "\n",
    "linear = sagemaker.estimator.Estimator(container,\n",
    "                                       role, \n",
    "                                       train_instance_count=1, \n",
    "                                       train_instance_type='ml.c4.xlarge',\n",
    "                                       output_path='s3://{}/{}/output'.format(bucket, prefix),\n",
    "                                       sagemaker_session=sess)\n",
    "linear.set_hyperparameters(feature_dim=59,\n",
    "                           mini_batch_size=100,\n",
    "                           predictor_type='regressor',\n",
    "                           epochs=10,\n",
    "                           num_models=32,\n",
    "                           loss='absolute_loss')\n",
    "\n",
    "linear.fit({'train': s3_train_data, 'validation': s3_validation_data})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "2adcc348-9ab5-4a8a-8139-d0ecd740208a"
    }
   },
   "source": [
    "---\n",
    "## Host\n",
    "\n",
    "Now that we've trained the linear algorithm on our data, let's create a model and deploy that to a hosted endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "linear_predictor = linear.deploy(initial_instance_count=1,\n",
    "                                 instance_type='ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Forecast\n",
    "\n",
    "Now that we have our hosted endpoint, we can generate statistical forecasts from it.  Let's forecast on our test dataset to understand how accurate our model may be.\n",
    "\n",
    "There are many metrics to measure forecast error.  Common examples include include:\n",
    "- Root Mean Square Error (RMSE)\n",
    "- Mean Absolute Percent Error (MAPE)\n",
    "- Geometric Mean of the Relative Absolute Error (GMRAE)\n",
    "- Quantile forecast errors\n",
    "- Errors that account for asymmetric loss in over or under-prediction\n",
    "\n",
    "For our example we'll keep things simple and use Median Absolute Percent Error (MdAPE), but we'll also compare it to a naive benchmark forecast (that week last year's demand * that week last year / that week two year's ago).\n",
    "\n",
    "There are also multiple ways to generate forecasts.\n",
    "- One-step-ahead forecasts:  When predicting for multiple data points, one-step-ahead forecasts update the history with the correct known value.  These are common, easy to produce, and can give us some intuition of whether out model is performing as expected.  However, they can also present an unnecessarily optimistic evaluation of the forecast.  In most real-life cases, we want to predict out well into the future, because the actions we may take based on that forecast are not immediate.  In these cases, we want know what the time-periods in between will bring, so generating a forecast based on the knowledge that we do, can be misleading.\n",
    "- Multi-step-ahead (or horizon) forecasts: In this case, when forecasting out of sample, each forecast builds off of the forecasted periods that precede it.  So, errors early on in the test data can compound to create large deviations for observations late in the test data.  Although this is more realistic, it can be difficult to create the forecasts, particularly as model complexity increases.\n",
    "\n",
    "For our example, we'll calculate both, but focus on the multi-step forecast accuracy.\n",
    "\n",
    "Let's start by generating the naive forecast."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "gas['thousands_barrels_lag52'] = gas['thousands_barrels'].shift(52)\n",
    "gas['thousands_barrels_lag104'] = gas['thousands_barrels'].shift(104)\n",
    "gas['thousands_barrels_naive_forecast'] = gas['thousands_barrels_lag52'] ** 2 / gas['thousands_barrels_lag104']\n",
    "naive = gas[split_test:]['thousands_barrels_naive_forecast'].as_matrix()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And investigating it's accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('Naive MdAPE =', np.median(np.abs(test_y - naive) / test_y))\n",
    "plt.plot(np.array(test_y), label='actual')\n",
    "plt.plot(naive, label='naive')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll generate the one-step-ahead forecast.  First we need a function to convert our numpy arrays into a format that can be handled by the HTTP POST request we pass to the inference container.  In this case that's a simple CSV string.  The results will be published back as JSON.  For these common formats we can use the Amazon SageMaker Python SDK's built in `csv_serializer` and `json_deserializer` functions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "linear_predictor.content_type = 'text/csv'\n",
    "linear_predictor.serializer = csv_serializer\n",
    "linear_predictor.deserializer = json_deserializer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll invoke the endpoint to get predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = linear_predictor.predict(test_X)\n",
    "one_step = np.array([r['score'] for r in result['predictions']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare forecast errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('One-step-ahead MdAPE = ', np.median(np.abs(test_y - one_step) / test_y))\n",
    "plt.plot(np.array(test_y), label='actual')\n",
    "plt.plot(one_step, label='forecast')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see our MdAPE is substantially better than the naive, and we actually swing from a forecasts that's too volatile to one that under-represents the noise in our data.  However, the overall shape of the statistical forecast does appear to better represent the actual data.\n",
    "\n",
    "Next, let's generate multi-step-ahead forecast.  To do this, we'll need to loop over invoking the endpoint one row at a time and make sure the lags in our model are updated appropriately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "multi_step = []\n",
    "lags = test_X[0, 0:4]\n",
    "for row in test_X:\n",
    "    row[0:4] = lags\n",
    "    result = linear_predictor.predict(row)\n",
    "    prediction = result['predictions'][0]['score']\n",
    "    multi_step.append(prediction)\n",
    "    lags[1:4] = lags[0:3]\n",
    "    lags[0] = prediction\n",
    "\n",
    "multi_step = np.array(multi_step)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now calculate the accuracy of these predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('Multi-step-ahead MdAPE =', np.median(np.abs(test_y - multi_step) / test_y))\n",
    "plt.plot(np.array(test_y), label='actual')\n",
    "plt.plot(multi_step, label='forecast')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see our multi-step ahead error performs worse than our one-step ahead forecast, but nevertheless remains substantially stronger than the naive benchmark forecast.  This 1.5 percentage point difference may not seem particularly meaningful, but at the large scale of many topline forecasts can mean millions of dollars in excess supply or lost sales."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Extensions\n",
    "\n",
    "Our linear model does a good job of predicting gasoline demand, but of course, improvements could be made.  The fact that statistical forecast actually underrepresents some of the volatility in the data could suggest that we have actually over-regularized the data.  Or, perhaps our choice of absolute loss was incorrect.  Rerunning the model with further tweaks to these hyperparameters may provide more accurate out of sample forecasts.  We also did not do a large amount of feature engineering.  Occasionally, the lagging time-periods have complex interrelationships with one another that should be explored.  Finally, alternative forecasting algorithms could be explored.  Less interpretable methods like ARIMA, and black-box methods like LSTM Recurrent Neural Networks have been shown to predict time-series very well.  Balancing the simplicity of a linear model with predictive accuracy is an important subjective question where the right answer depends on the problem being solved, and its implications to the business."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (Optional) Clean-up\n",
    "\n",
    "If you're ready to be done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sagemaker.Session().delete_endpoint(linear_predictor.endpoint)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_python3)",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
